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Abstract 

This paper explores the kinds of 
probabilistic relations that are im- 
portant in syntactic disambiguation. 
It proposes that two widely used 
kinds of relations, lexical dependen- 
cies and structural relations, have 
complementary disambiguation ca- 
pabilities. It presents a new model 
based on structural relations, the 
Tree-gram model, and reports exper- 
iments showing that structural rela- 
tions should benefit from enrichment 
by lexical dependencies. 

1 Introduction 

Head-lexicalization currently pervades in the 



parsing literature e.g. ([Eisner, 1996| ; [Collins 



1997; Charniak, 1999). This method extends 
every treebank nonterminal with its head- 
word: the model is trained on this head lexical- 
ized treebank. Head lexicalized models extract 
probabilistic relations between pairs of lexical- 
ized nonterminals ("bilexical dependencies"): 
every relation is between a parent node and 
one of its children in a parse-tree. Bilexical 
dependencies generate parse-trees for input 
sentences via Markov processes that generate 
Context-Free Grammar (CFG) rules (hence 
Markov Grammar ( pharniak, 1999[ )). 

Relative to Stochastic CFGs (SCFGs), 
bilexical dependency models exhibit good 
performance. However, bilexical dependen- 
cies capture many but not all relations be- 
tween words that are crucial for syntactic 
disambiguation. We give three examples of 
kinds of relations not captured by bilexical- 



dependencies. Firstly, relations between non- 
head words of phrases, e.g. the relation be- 
tween "more" and "than" in "more apples 
than oranges" or problems of PP attachments 
as in "he ate pizza (with mushrooms) / (with 
a fork)". Secondly, relations between three or 
more words are, by definition, beyond bilexi- 
cal dependencies (e.g. between "much more" 
and "than" in "much more apples than or- 
anges"). Finally, it is unclear how bilexical 
dependencies help resolve the ambiguity of id- 
ioms, e.g. "Time flies like an arrow" (neither 
"time" prefers to "fly" , nor the fictitios beasts 
"Time flies" have taste for an "arrow"). 

The question that imposes itself is, indeed, 
what relations might complement bilexical de- 
pendencies ? We propose that bilexical depen- 
dencies can be complemented by structural re- 
lations ( pcha, 1990 ), i.e. cooccurrences of syn- 
tactic structures, including actual words. An 
example model that employs one version of 
structural relations is Data Oriented Parsing 
(DOP) ( [Bod, 1995[ ). OOP's parameters are 



"subtrees", i.e. connected subgraphs of parse- 
trees that constitute combinations of CFG 
rules, including terminal rules. 

Formally speaking, "bilexical dependen- 
cies" and "structural relations" define two dis- 
joint sets of probabilistic relations. Bilex- 
ical dependencies are relations defined over 
direct dominance head lexicalized nontermi- 
nals (see ( Satta, 2000[) ); in contrast, struc- 
tural relations are defined over words and ar- 
bitrary size syntactic structures (with non- 
lexicalized nonterminals). Apart from formal 
differences, they also have complementary ad- 
vantages. Bilexical-dependencies capture in- 
fiuential lexical relations between heads and 
dependents. Hence, all bilexical dependency 



probabilities are conditioned on lexical infor- 
mation and lexical information is available at 
every point in the parse-tree. Structural rela- 
tions, in contrast, capture many relations not 
captured by bilexical-dependencies (e.g. the 
examples above). However, structural rela- 
tions do not always percolate lexical informa- 
tion up the parse-tree since their probabilities 
are not always lexicalized. This is a serious dis- 
advantage when parse-trees are generated for 
novel input sentences since e.g. subcat frames 
are hypothesized for nodes high in the parse- 
tree without reference to their head words. 

So, theoretically speaking, bilexical depen- 
dencies and structural relations have comple- 
mentary aspects. But, what are the empirical 
merits and limitations of structural relations ? 
This paper presents a new model based on 
structural relations, the Tree-gram model, 
which allows head-driven parsing. It studies 
the effect of percolating head categories on 
performance and compares the performance 
of structural relations to bilexical dependen- 
cies. The comparison is conducted on the Wall 



Street Journal (WSJ) corpus ( Marcus et al 



1993). In the remainder, we introduce the 
Tree-gram model in section |2|, discuss prac- 
tical issues in section 0, exhibit and discuss 
the results in section ^ and in section |5| we 
give our conclusions. 

2 The Tree-gram model 

For observing the effect of percolating infor- 
mation up the parse-tree on model behavior, 
we introduce pre-head enrichment, a struc- 
tural variant of head-lexicalization. Given a 
training treebank TB, for every non-leaf node 
fi we mark one of its children as the head- 
child, i.e. the child that dominates the head- 
word|^ of the constituent under //. We then 
enrich this treebank by attaching to the la- 
bel of every phrasal node (i.e. nonterminal 
that is not a POS-tag) a pre-head represent- 
ing its head-word. The pre-head of node fj, is 
extracted from the constituent parse-tree un- 
der node fi. In this paper, the pre-head of /i 
consists of 1) the POS-tag of the head-word 
of (called 1** order pre-heads or l^''^), and 



possibly 2) the label of the mother node of 
that POS-tag (called 2"^^ order or 2^^). Pre- 
heads here also include other information de- 
fined in the sequel, e.g. subcat frames. The 
complex categories that result from the enrich- 
ment serve as the nonterminals of our train- 
ing treebank; we refer to the original treebank 
symbols as "WSJ labels". 

2.1 Generative models 

A probabilistic model assigns a probability 
to every parse-tree given an input sen- 
tence S, thereby distinguishing one parse 
T* = argmaxT P{T\S) = argmaxT P{T,S). 
The probability PiT, S) is usually estimated 
from cooccurrence statistics extracted from a 
treebank. In generative models, the tree T is 
generated in top down derivations that rewrite 
the start symbol TOP into the sentence S. 
Each rewrite-step involves a "rewrite-rule" 
together with its estimated probability. In 
the present model, the "rewrite-rules" differ 
from the CFG rules and combinations thereof 
that can be extracted from the treebank. We 
refer to them as Tree-grams (abbreviated T- 
grams) . T-grams provide a more general- form 
for Markov Grammar rules ( pollins, 1997 ; 



Charniak, 1999) as well as DOP subtrees. 
In comparison with DOP subtrees, T-grams 
capture more structural relations, allow 
head-driven parsing and are easier to combine 
with bilexical-dependencies. 

2.2 T-gram extraction 

Given a parse T from the training treebank, 
we extract three disjoint T-gram sets, called 
roles, from every one of its non-leaf nodesQ /i: 
the head-role Ti-di), the left-dependent role 
C{fi) and the right-dependent role 7^(^). The 
role of a T-gram signifies the T-gram's contri- 
bution to stochastic derivations: t € fi car- 
ries a head-child of its root node label, t G C 
{t G TZ) carries left (resp. right) dependents 
for other head T-grams that have roots labeled 
the same as the root of t. Like in Markov 
Grammars, a head-driven derivation generates 
first a head-role T-gram and attaches to it left- 
and right-dependent role T-grams. We discuss 



^ Head- identification procedure by (Collins, 1997) 



Assuming that every node has a unique address. 



Figure 1: Constituent under node /u: d > 1. 

these derivations right after we specify the T- 
gram extraction procedure. 

Let d represent the depth^ of the con- 
stituent tree-structure that is rooted at ^, 
H represent the label of the head-child of 
/i, and A represent the special stop symbol 
that encloses the children of every node 
(see figure |^). Also, for convenience, let 5^ 
be equal to A iff A; = n and NILL (i.e. 
the empty tree-structure) otherwise. We 
specify the extraction for d = 1 and for d > 1. 
When d = 1, the label of ^ is a POS-tag 
and the subtree under fi is of the form 
pt AwA, where u; is a word. In this case 
n{n) = {pt AwA} and C{fi) = 7^(/u) = 0. 
When d > 1: the subtree under ^ has the form 
A ^ AL„(4) . . . Li{t[) H{t„) Ri{tl) . . . ii^(C)A 
(figure |l|), where every t{, and tn is the 
subtree dominated by the child node of /i 
(labeled respectively Lj, Rj or H) whose ad- 
dress we denote respectively with childL^fi, i), 
childji{fj,, j) and childnifJ-)- We extract three 
sets of T-grams from fi: 

7^(^) : contains V 1 < i < n and 1 < j < m, 
A ^ 6^U{Xl) . . . H{Xh) . . . Rj{X^j)Sf, 
where is either in 7i{childH{^^)) or 
NILL, and every (resp. X^) is either 
a T-gram from ^-[{childi^fi, z)) (resp. 
n{childR{n, z)) ) or NILL. 

C{fi): contains A 5'^Lk{Xk) ■ ■ ■ Li{X.i), for 
all 1 < i < /c < n, where every X^, 
i < z < k, is either a T-gram from 
n{childL{tJ-, z)) or NILL, 

n{ij) : contains A ^ Ri{X^) . . . Rk{Xk)5f, 
for all \ < i < k < m, where every X^, 
i < z < k, is either a T-gram from 
TC{childR{n, z)) or NILL, 

■^The depth of a (sub)tree is the number of edges in 
the longest path from its root to a leaf node. 



last week a deal was VBN 

I 

sealed 

Figure 2: An example parse-tree. 

Note that every T-gram's non-root and non- 
leaf node dominates a head-role T-gram (spec- 
ified by 7i{child ■ ■ •)). 

A non-leaf node labeled by nonterminal 
A is called complete, denoted "[A]", iff A de- 
limits its sequence of children from both sides; 
when A is to the left (right) of the children of 
the node, the node is called left (resp. right) 
complete, denoted "[A" (resp. "^]"). When 
^ is not left (right) complete it is open from 
the left (resp. right); when is left and right 
open, it is called open. 

Figure § exhibits a parse-treeQ: the number 
of the head-child of a node is specified between 
brackets. Figure ^ shows some of the T-grams 
that can be extracted from this tree. 

Having extracted T-grams from all 
non-leaf nodes of the treebank, we ob- 
tain W= U^ieri?W(A^)> ^= Ui.eTB^fJ') and 
TZ= U^gT_B^(/^)- and TZa represent 

the subsets of resp. H, C and TZ that contain 
those T-grams that have roots labeled A. 
Xa{B) G {CA{B),nA{B),nA{B)] specifies 
that the extraction took place on some 
treebank B other than the training treebank. 

2.3 T-gram generative processes 

Now we specify T-gram derivations assuming 
that we have an estimate of the probability of 
a T-gram. We return to this issue right after 
this. A stochastic derivation starts from the 
start nonterminal TOP. TOP is a single node 
partial parse-tree which is simultaneously the 
root and the only leaf node. A derivation ter- 
minates when two conditions are met (1) ev- 
ery non-leaf node in the generated parse-tree 
is complete (i.e. A delimits its children from 

^Pre-heads are omitted for readability. 
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Figure 3: Some T-grams extracted from the tree in figure the superscript on the root label specifies the 
T-gram role,, e.g. the left-most T-gram is in the left-dependent role. Non-leaf nodes are marked with "[" and 
"]" to specify whether they are complete from the left/right or both (leaving open nodes unmarked). 



both sides) and (2) all leaf nodes are labeled 
with terminal symbols. Let 11 represent the 
current partial parse-tree, i.e. the result of the 
preceding generation steps, and let Cn repre- 
sent that part of 11 that influences the choice 
of the next step, i.e. the conditioning history. 
The generation process repeats the following 
steps in some order, e.g. head-left-right: 

Head-generation: Select a leaf node /i la- 
beled by a nonterminal A, and let A gen- 
erate a head T-gram t e TIa with proba- 
bility PH{t\A,Cu)- This results in a partial 
parse-tree that extends 11 at /i with a copy 
of t (as in CFGs and in DOP). 

Modification: Select from 11 a non-leaf 
node ^ that is not complete. Let A be the 
label of /i and T = A ^ Xi{xi) ■ ■ ■ Xi,{xi,) be 
the tree dominated by // (see figure ^ : 

Left: if /x is not left-complete, let ^ gen- 
erate to the left of T a left-dependent 
T-gram i = ^(^) ^ Li(Zi) • • • L,(/,) 
from Ca with probability Pi(i|^,Cn) 
(see figure Q (L)); this results in 
a partial parse-tree that is ob- 
tained by replacing T in H with 

A^Li{h)---La{la)Xi{xi)---X^{Xb), 

Right: this is the mirror case (see fig- 
ure ^ (R))- The generation probability 
isPR{t\A,Cn). 

Figure ^ shows a derivation using T-grams 
(e), (a) and (d) from figure ^ applied to T- 
gram TOP — > S. Note that each derivation- 
step probability is conditioned on A, the label 



of node in H where the current rewriting 
is taking place, on the role (W, C or Tl) of 
the T-gram involved, and on the relevant his- 
tory Cn- Assuming beyond this that stochas- 
tic independence between the various deriva- 
tion steps holds, the probability of a derivation 
is equal to the multiplication of the conditional 
probabilities of the individual rewrite steps. 

Unlike SCFGs and Markov grammars but 
like DOP, a parse-tree may be generated 
via different derivations. The probabil- 
ity of a parse-tree T is equal to the 
sum of the probabilities of the derivations 
that generate it (denoted der^T), i.e. 
P{T, S) = J2der^T Pider, S). However, be- 
cause computing argmaxx P{T, S) can not 
be achieved in deterministic polynomial time 
([Sima'an, 1996 ) , we apply estimation methods 
that allow tractable parsing. 

2.4 Estimating T-gram probabilities 

Let count{Yi, ■ ■ ■ Ym) represent the occurrence 
count for joint event (Yi • • • Ym) in the train- 
ing treebank. Consider a T-gram t € Xa, 
Xa G {Ca,T^A,'Ha}, and a conditioning his- 



tory Cn- The estimate 



count{t,XA,Cji) 
J2x£X. count(x,XA,Cn) 



assumes no hidden elements (different deriva- 
tions per parse-tree), i.e. it estimates the prob- 
ability Px{t\A,CYi) directly from the tree- 
bank trees (henceforth direct- estimate). This 
estimate is employed in DOP and is not 
Maximum-Likelihood ( Bonnema et al., 1999| ). 
We argue that the bias of the direct esti- 
mate allows approximating the preferred parse 
by the one generated by the Most Probable 
Derivation (MPD). This is beyond the scope 



Figure 4: T-gram t is generated at /x: (L) t € Ca, (R) t € TZa 
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Figure 5: A T-gram derivation: the rewriting of TOP is not shown. An arrow marks the node where rewriting 
takes place. Following the arrows: 1. A left T-gram with root [S is generated at node S]: S is complete. 2. A 
head-role T-gram is generated at node NP: all nodes are either complete or labeled with terminals. 



of this paper and will be discussed elsewhere. 

2.5 WSJ model instance 

Up till now Cn represented conditioning 
information anonymously in our model. 
For the WSJ corpus, we instantiate Cn 
as follows: 1. Adjacency: The flag FL{t) 
(Fij(i)) tells whether a left-dependent (right- 
dependent) T-gram t extracted from some 
node II dominates a surface string that 
is adjacent to the head- word of /i (de- 
tail in ( Collins, 1997| )). 2. Subcat-frames: 



( Collins, 1997 ) subcat frames are adapted: 
with every node that dominates a rule 
A — > AL„ . . . Li H Ri . . . in the tree- 

bank (figure H), we associate two (possibly 
empty) multisets of complements: SCj^ and 
5(7^. Every complement in SCj^ ('S'C'^) 
represents some left (right) complement-child 
of /i. This changes T-gram extraction as fol- 
lows: with every non-leaf node in a T-gram 
that is extracted from a tree in this enriched 



treebank we have now a left and a right 
subcat frame associated. Consider the root 
node X in a T-gram extracted from node /x 
and let the children of re be Yi ■ ■ ■ 1/ (a sub- 
sequence of ALn, ■ ■ ■ , H, - ■ ■ RmA). The left 
(right) subcat frame of x is subsumed by SC'^ 
(resp. SCj^) and contains those complements 
that correspond to the left-dependent (resp. 
right-dependent) children of /i that are not 
among Yi---Yf. Tree-gram derivations are 
modified accordingly: whenever a T-gram is 
generated (together with the subcat frames 
of its nodes) from some node in a partial- 
tree, the complements that its root dominates 
are removed from the subcat frames of fi. 
Figure |6| shows a small example of a deriva- 
tion. 3. Markovian generation: When node 
H has empty subcat frames, we assume 1st- 
order Markov processes in generating both C 
and TZ T-grams around its Ti T-gram: LM^ 
and RM^^ denote resp. the left- and right- 
most children of node /U. Let XRM'^ and 
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Figure 6: S^{'^p}l jg (left-open right-complete) node labeled S with a left subcat frame containing an NP. 
After the first rewriting, the subcat frame becomes empty since the NP complement was generated resulting in 
[S]^'^ . The Other subcat frames are empty and are not shown here. 




XLM^' be equal to resp. RM^' and LM^" if 
the name of the T-gram system contains the 
word +Markov (otherwise they are empty). 

Let fi, labeled A, be the node where the 
current rewrite-step takes place, P be the 
WSJ-label of the parent of n, and H the 
WSJ-label of the head-child of fi. Our 
probabilities are: PH{t\A,Cu) PH{t\A,P), 
PL{t\A,Cn) ^ PL{t\A,H,SClFL{t), XRM'' ) , 
PR{t\A,Cn) ~ PR{t\A,H,SC^,Fn{t),XLM^^). 



3 Implementation issues 



Sections 02-21 WSJ Penn Treebank ( paF 



cus et al., 1993]) (release 2) are used for 



training and section 23 is held-out for test- 
ing (we tune on section 24). The parser- 
output is evaluated by "evalb"|^, on the PAR- 
SEVAL measures ( [Black et al., 199T| ) compar- 
ing a proposed parse P with the correspond- 
ing treebank parse T on Labeled Recall (LR = 

number of correct constituents in P \ j r, 1 J p 
„ttm6er of constituents in T /' -L'SDeieO rre- 

/T p number of correct constituents in P \ 

\ number of constituents in P 

and Crossing Brackets (CB = number of con- 
stituents in P that violate constituent bound- 
aries in T). 

T-gram extraction: The number of T-grams 
is limited by setting constraints on their form 
much like n-grams. One upperbound is set 
on the depth^ (d), a second on the number 
of children of every node (6), a third on the 



^http: / /www. research. att .com / mcoUins/. 

®T-gram depth is the length of the longest path in 
the tree obtained by right/left-linearization of the T- 
gram around the T-gram nodes' head-children. 



sum of the number of nonterminal leafs with 
the number of (left /right) open-nodes (n), and 
a fourth (w) on the number of words in a 
T-gram. Also, a threshold is set on the fre- 
quency (/) of the T-gram. In the experiments 
n < 4, w < 3 and / > 5 are fixed while d 
changes. Unknown words and smoothing: We 
did not smooth the relative frequencies. Sim- 
ilar to ( IColhns, 199"7D , every word occurring 
less than 5 times in the training-set was re- 
named to CAP-FUNKNOWN-hSUFF, where 
CAP is 1 if its first-letter is capitalized and 
otherwise, and SUFF is its suffix. Unknown 
words in the input are renamed this way be- 
fore parsing starts. Tagging and parsing: An 
input word is tagged with all POS-tags with 
which it cooccurred in the training treebank. 
The parser is a two-pass CKY parser: the first 
pass employs T-grams that fulfill d = 1 in 
order to keep the parse-space under control 
before the second-pass employs the full Tree- 
gram model for selecting the MPD. 

4 Empirical results 

First we review the lexical-conditionings in 
previous work (other important condition- 
ings are not discussed for space reasons). 
Magerman95 ( [Magerman, 1995 ; Jelinek et| 



|al., 19941 ) grows a decision-tree to estimate 
P{T\S) through a history-based approach 
which conditions on actual- words. Char- 
niak ([Charniak, 1997] ) presents lexicalizations 
of SCFGs: the Minimal model conditions 
SCFG rule generation on the head- word of its 
left-hand side, while Charniak97 further con- 



System 


LR% LP% CB OCB% 2CB% 


Minimal (jCharniak, 1997) 
Magerman95 (Magerman, 19951) 
Charniak97 (|Charniak, 19971) 
Collins97 (|Collins, 1997|) 
Charniak99 (|Charniak, 199q) 


83.4 84.1 1.40 53.2 79.0 
84.6 84.9 1.26 56.6 81.4 

87.5 87.4 1.00 62.1 86.1 
88.1 88.6 0.91 66.4 86.9 
90.1 90.1 0.74 70.1 89.6 


SCFG (pharniak, 1997|) 
T-gram {d < 5 (2^^)) 


71.7 75.8 2.03 39.5 68.1 
82.9 85.1 1.30 58.0 82.1 



Table 1: Various results on WSJ section 23 sentences < 40 words (2245 sentences). 



ditions the generation of every constituent's 
head-word on the head-word of its parent- 
constituent, effectively using bilexical depen- 



dencies. Collins97 (|Collins, 19971) uses a bilex- 
icalized O^^'-order Markov Grammar: a lexical- 
ized CFG rule is generated by projecting the 
head-child first followed by every left and right 
dependent, conditioning these steps on the 
head- word of the constituent. Collins97 ex- 
tends this scheme to deal with subcat frames, 
adjacency, traces and wh-movement. Char- 
niak99 conditions lexically as Collins does but 
also exploits up to 3'''^-order Markov processes 
for generating dependents. Except for T- 
grams and SCFGs, all systems smooth the rel- 
ative frequencies with much care. 

Sentences < 40 words (including punctua- 
tion) in section 23 were parsed by various T- 
gram systems. Table |^ shows the results of 
some systems including ours. Systems condi- 
tioning mostly on lexical information are con- 
trasted to SCFGs and T-grams. Our result 
shows that T-grams improve on SCFGs but 
fall short of the best lexical-dependency sys- 
tems. Being 10-12% better than SCFGs, com- 
parable with the Minimal model and Mager- 
man95 and about 7.0% worse than the best 
system, it is fair to say that (depth 5) T- 
grams perform more like bilexicalized depen- 
dency systems than bare SCFGs. 

Table § exhibits results of various T-gram 
systems. Columns 1-2 exhibit the tradi- 
tional DOP observation about the effect of 
the size of subtrees/T-grams on performance. 
Columns 3-5 are more interesting: they show 
that even when T-gram size is kept fixed, 
systems that are pre-head enriched improve 
on systems that are not pre-head enriched 



(O^''^). This is supported by the result of col- 
umn 1 in contrast to SCFG and Collins97 (ta- 
ble Hj): the Dl T-gram system differs from 
Collins97 almost only in pre-head vs. head 
enrichment and indeed performs midway be- 
tween SCFG and Collins97. This all sug- 
gests that allowing bilexical dependencies in 
T-gram models should improve performance. 
It is noteworthy that pre-head enriched sys- 
tems are also more efficient in time and space. 
Column 6 shows that adding Markovian con- 
ditioning to subcat frames further improves 
performance suggesting that further study of 
the conditional probabilities of dependent T- 
grams is necessary. Now for any node in 
a gold / proposed parse, let node-height be 
the average path-length to a word dominated 
by that node. We set a threshold on node- 
height in the gold and proposed parses and ob- 
serve performance. Figure |^ plots the F-score 
= (2*LP*LR)/(LP+LR) against node-height 
threshold. Clearly, performance degrades as 
the nodes get further from the words while 
pre-heads improve performance. 




2 3 4 5 6 7 8 9 10 11 12 13 
Height threshold 

Figure 7: Heigher nodes are harder 
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CB 


1.70 


1.32 


1.44 


1.43 


1.48 


1.30 


#sens 




2245 




first 1000 




2245 



Tabic 2: Results of various systems: {d < i), i^^ (pre-head length is i), +Markov (P* order 
Markov conditioning on nodes with empty subcat frames for generating L and R T-grams). 



5 Conclusions 

We started this paper wondering about the 
merits of structural-relations. We presented 
the T-gram model and exhibited empirical ev- 
idence for the usefulness as well as the short- 
comings of structural relations. We also pro- 
vided evidence for the gains from enrichment 
of structural relations with semi-lexical infor- 
mation. In our quest for better modeling, we 
still need to explore how structural-relations 
and bilexical dependencies can be combined. 
Probability estimation, smoothing and effi- 
cient implementations need special attention. 
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